CytofIN

CytofIN (CyTOF integration) is an R package for homogenizing and integrating heterogenous CyTOF data from diverse data sources.

Before CyTOF data integration, all CyTOF files need to be homogenized to have consistent channels. CytofIN requires that all input CyTOF files be homogenized based on a user-provided standardized panel with user defined search pattern. To normalize the CyTOF data, CytofIN uses a novel generalized anchor strategy that defines the based line of the signal between batch to correct for batch effects. One anchor needs to be identified by the user from each plate (batch). A reference anchor is generated based on the mean expression of all identified anchors from each plate (batch). Next, a user-specified transformation function is applied to fit each plate-specific anchor to the reference data distribution and the same transformation is then applied to correct the sample data signal on each plate.

CytofIN provided three functions for CyTOF data integration:

  1. homogenize-this function performs batch homogenization of CyTOF data based on a user-defined panel and search pattern.
  2. anprep-this function generates reference statistics from anchors identified from each plate (batch).
  3. annorm-this function performs signal normalization using transformation function based on the anchor from anprep.
  4. annorm_nrs-this function performs signal normalization using stabilized channels as internal anchors.
library(knitr)
hook_output <- knit_hooks$get("output")
knit_hooks$set(output = function(x, options) {
  lines <- options$output.lines
  if (is.null(lines)) {
    return(hook_output(x, options))  # pass to default hook
  }
  x <- unlist(strsplit(x, "\n"))
  more <- "..."
  if (length(lines)==1) {        # first n lines
    if (length(x) > lines) {
      # truncate the output, but add ....
      x <- c(head(x, lines), more)
    }
  } else {
    x <- c(more, x[lines], more)
  }
  # paste these lines together
  x <- paste(c(x, ""), collapse = "\n")
  hook_output(x, options)
})
library(devtools)
install_github('bennyyclo/Cytofin')
#> Skipping install of 'cytofin' from a github remote, the SHA1 (a9cf729d) has not changed since last install.
#>   Use `force = TRUE` to force installation

CyTOF data homogenization

Description: the homognize function takes a user input antigen panel table, which includes standardized antigen name and associated antigen search pattern. Given two CyTOF files with distinct antigen naming, the program performs a regular expression search to match the synonymous term in the panel and correct the antigen name with standardized names in the panel.

Function Definition:

homogenize(metadata_filename, panel_filename, input_file_dir, output_file_dir)

Input:

metadata_file: metadata table of raw CyTOF files (.fcs)(must be in the current directory).

panel_filename: standardized antigen panel table file (.xlsx/.csv)(must be in the current directory).

input_file_dir: folder directory containing input raw CyTOF files.

output_file_dir: folder directory containing output homogenized files.

Output: homogenized CyTOF file with user-defined channels presented in the standardized antigen table.

\(~\)

CyTOF data normalization using external anchors

The external anchor normalization steps include: 1. preparation of external anchors and 2. application of transformation function.

\(~\)

  1. Anchors preparation:

Description: the anprep function concatenates the identified anchor file, one file per plate/batch, and subsequently generates summary statistics including mean and variance which will be used for batch correction.

Function definition:

anprep(metadata_filename, panel_filename, input_file_dir)

Input:

metadata_filename: metadata table of anchor CyTOF files (.fcs)(must be in the current directory).

panel_filename: standardized antigen panel table file (.xlsx/.csv)(must be in the current directory).

input_file_dir: folder directory containing output data.

Output: an RData object containing reference statistics of the universal reference and concatenated anchor FCS files.

The RData object stored the following variables regarding the universal reference. The variables are exported from the RData object for subsequent batch normalization:

mean_uni: a 1-dimensional array of mean expression for all input markers

mean_var: a 1-dimensional array of mean variance values of all marker expressions.

mean_uni_mean: the mean value of mean_uni array.

mean_uni_var: the mean value of mean_var array.

\(~\)

  1. Data transformation:

Description: the annorm function applied different transformation functions (modes) to normalize each anchor to the referenece statistcs generated by the anprep function.

Function definition:

annorm (control_metadata_filename, control_data_filename, sample_metadata_filename, panel_filename, input_file_dir, val_file_dir="none" ,output_file_dir, mode)

Input:

control_metadata_file: metadata table of anchor CyTOF files (.fcs)(must be in the current directory).

control_data_filename: RData object containing anchor referene statistics (must be in the current directory).

sample_metadata_filename: metadata table of homogenized CyTOF files (.fcs)(must be in the current directory).

panel_filename: standardized antigen panel table file (.xlsx/.csv)(must be in the current directory).

input_file_dir: folder directory containing input homogenized CyTOF data file.

val_file_dir: folder directory containing validation homogenized CyTOF data file (optional).

output_file_dir: folder directory containing output normalized CyTOF data file.

mode: transformation function used for normaliztion (meanshift, meanshift_bulk, variance, z_score, beadlike).

Output: normalized CyTOF files.

\(~\)

CyTOF data normalization using internal anchors

Description: In the event that the external references are not available, internal anchors can be used. Here, we identified the most stable channels as internal anchors using a PCA-based non-redundnacy score (NRS). A minimal of three channels should be selected to establish an internal refernece from which signal can be calibrated between CyTOF files.

Function definition:

annorm_nrs(sample_metadata_filename, panel_filename, input_file_dir, val_file_dir="none", output_file_dir, nchannels)

Input:

sample_meta_filename: metadata table of homogenized CyTOF files (.fcs)(must be in the current directory).

panel_filename: standardized antigen panel table file (.xlsx/.csv)(must be in the current directory).

val_file_dir: folder directory containing validation homogenized CyTOF data file (optional).

output_file_dir: folder directory containing output normalized CyTOF data file.

nchannels: number of stabilized channels used for normalization.

Output: normalized CyTOF files.

\(~\)

Computational pipeline for CyTOF data integration

Below is a demo Rscript using CytofIN package for CyTOF data integration.

First, let's homogenize the panel of all FCS files using the homogenize function:

#import cytofin R package
library(cytofin)

#homogenization antigen panel, use the demo data supplied with the package
metadata_filename <- paste0(path.package("cytofin"),"/extdata/test_metadata_raw.csv")
panel_filename <- paste0(path.package("cytofin"),"/extdata/test_panel.csv")
input_file_dir_homogenize <- paste0(path.package("cytofin"),"/extdata/test_raw_fcs_files/")
output_file_dir_homogenize <- "out_test/"
homogenize(metadata_filename, panel_filename, input_file_dir_homogenize, output_file_dir_homogenize)
#> Warning in dir.create(output_file_dir): 'out_test' already exists
#>                             filename  cohort plate_number patient_id condition
#> 1       ALL05v2_Plate2_UPN94 das.fcs ALL05v2       plate2      UPN94       Das
#> 2       ALL08_Plate8_UPN26 basal.fcs   ALL08       plate8      UPN26     Basal
#> 3  CRLF2_Plate1_UPN53 das + TSLP.fcs   CRLF2       plate1      UPN53  das_TSLP
#> 4  ALL05v2_Plate2_healthy basal1.fcs ALL05v2       plate2    Healthy     Basal
#> 5   ALL08_Plate8_Healthy03 basal.fcs   ALL08       plate8  Healthy03     Basal
#> 6    CRLF2_Plate1_Healthy 04 BCR.fcs   CRLF2       plate1  Healthy04       BCR
#> 7          MS_Plate5_SU978 Basal.fcs  MajSak       plate5      SU978     Basal
#> 8           MS_Plate5_Healthy BM.fcs  MajSak       plate5    Healthy        BM
#> 9       SJ_Plate2_TB010950_Basal.fcs  StJude       plate2   TB010950     Basal
#> 10          SJ_Plate2_Healthy_BM.fcs  StJude       plate2    Healthy        BM
#>    population                                    validation
#> 1        <NA>      homogenized_ALL05v2_plate2_UPN94 das.fcs
#> 2        <NA>      homogenized_ALL08_plate8_UPN26 basal.fcs
#> 3        <NA> homogenized_CRLF2_plate1_UPN53 das + TSLP.fcs
#> 4           1 homogenized_ALL05v2_plate2_healthy basal1.fcs
#> 5        <NA>  homogenized_ALL08_plate9_Healthy03 basal.fcs
#> 6        <NA>   homogenized_CRLF2_plate1_Healthy 04 BCR.fcs
#> 7        <NA>     homogenized_MajSak_plate5_SU978 Basal.fcs
#> 8        <NA>      homogenized_MajSak_plate5_Healthy BM.fcs
#> 9        <NA>  homogenized_StJude_plate2_TB010950_Basal.fcs
#> 10       <NA>      homogenized_StJude_plate2_Healthy_BM.fcs
#>            desc        range metal_pattern antigen_pattern Lineage Functional
#> 1          Time         Time       [Tt]ime         [Tt]ime       0          0
#> 2  Event_length Event_length         ength           ength       0          0
#> 3     (Pd102)Di          BC1         Pd102             BC1       0          0
#> 4     (Pd104)Di          BC2         Pd104             BC2       0          0
#> 5     (Pd105)Di          BC3         Pd105             BC3       0          0
#> 6     (Pd106)Di          BC4         Pd106             BC4       0          0
#> 7     (Pd108)Di          BC5         Pd108             BC5       0          0
#> 8     (Pd110)Di          BC6         Pd110             BC6       0          0
#> 9     (In113)Di   CD235_CD61         In113           CD235       1          0
#> 10    (In115)Di         CD45         In115            CD45       1          0
#> 11    (La139)Di        cPARP         La139            PARP       0          1
#> 12    (Pr141)Di     pPLCg1_2         Pr141        pPLCg1_2       0          1
#> 13    (Nd142)Di         CD19         Nd142            CD19       1          0
#> 14    (Nd143)Di         CD22         Nd143            CD22       1          0
#> 15    (Nd144)Di       p4EBP1         Nd144          p4EBP1       0          1
#> 16    (Nd145)Di      tIkaros         Nd145         tIkaros       1          0
#> 17    (Nd146)Di        CD79b         Nd146           CD79b       1          0
#> 18    (Sm147)Di         CD20      [PS]m147            CD20       1          0
#> 19    (Nd148)Di         CD34         Nd148            CD34       1          0
#> 20    (Sm149)Di       CD179a         Sm149          CD179a       1          0
#> 21    (Nd150)Di       pSTAT5         Nd150          pSTAT5       0          1
#> 22    (Sm152)Di         Ki67         Sm152            Ki67       0          1
#> 23    (Eu153)Di         IgMi         Eu153            IgMi       1          0
#> 24    (Sm154)Di Kappa_lambda         Sm154            appa       0          1
#> 25    (Gd156)Di         CD10         Gd156            CD10       1          0
#> 26    (Gd158)Di       CD179b         Gd158          CD179b       1          0
#> 27    (Gd160)Di         CD24         Gd160            CD24       1          0
#> 28    (Dy161)Di        TSLPr         Dy161           TSLPr       0          1
#> 29    (Dy162)Di        CD127         Dy162           CD127       1          0
#> 30    (Dy163)Di         RAG1         Dy163            RAG1       1          0
#> 31    (Dy164)Di          TdT         Dy164              Td       1          0
#> 32    (Ho165)Di         Pax5         Ho165            Pax5       1          0
#> 33    (Er166)Di         pSyk         Er166            pSyk       0          1
#> 34    (Er167)Di         CD43         Er167            CD43       1          0
#> 35    (Er168)Di         CD38         Er168            CD38       1          0
#> 36    (Er170)Di          CD3         Er170            CD3^       1          0
#> 37    (Yb171)Di         CD33         Yb171       FITC|CD33       0          1
#> 38    (Yb172)Di          pS6         Yb172             pS6       0          1
#> 39    (Yb173)Di         pErk         Yb173            pErk       0          1
#> 40    (Yb174)Di        HLADR         Yb174           HLADR       1          0
#> 41    (Lu175)Di         IgMs         Lu175            IgMs       1          0
#> 42    (Yb176)Di        pCreb     [YbLu]176           pCreb       0          1
#> 43    (Ir191)Di         DNA1         Ir191            DNA1       0          1
#> 44    (Ir193)Di         DNA2         Ir193            DNA2       0          1
#>    General
#> 1        1
#> 2        1
#> 3        1
#> 4        1
#> 5        1
#> 6        1
#> 7        1
#> 8        1
#> 9        0
#> 10       0
#> 11       0
#> 12       0
#> 13       0
#> 14       0
#> 15       0
#> 16       0
#> 17       0
#> 18       0
#> 19       0
#> 20       0
#> 21       0
#> 22       0
#> 23       0
#> 24       0
#> 25       0
#> 26       0
#> 27       0
#> 28       0
#> 29       0
#> 30       0
#> 31       0
#> 32       0
#> 33       0
#> 34       0
#> 35       0
#> 36       0
#> 37       0
#> 38       0
#> 39       0
#> 40       0
#> 41       0
#> 42       0
#> 43       0
#> 44       0
#> uneven number of tokens: 529
#> The last keyword is dropped.
#> uneven number of tokens: 529
#> The last keyword is dropped.
#> filename: ALL05v2_Plate2_UPN94 das.fcs 
#> 1 
#> matched data_antigen: Time ref_antigen: Time ref_antigen_pattern [Tt]ime 
#> 2 
#> matched data_antigen: Event_length ref_antigen: Event_length ref_antigen_pattern ength 
#> 3 
#> matched data_antigen: BC1 ref_antigen: BC1 ref_antigen_pattern BC1 
#> 4 
#> matched data_antigen: BC2 ref_antigen: BC2 ref_antigen_pattern BC2 
#> 5 
#> matched data_antigen: BC3 ref_antigen: BC3 ref_antigen_pattern BC3 
#> 6 
#> matched data_antigen: BC4 ref_antigen: BC4 ref_antigen_pattern BC4 
#> 7 
#> matched data_antigen: BC5 ref_antigen: BC5 ref_antigen_pattern BC5 
#> 8 
#> matched data_antigen: BC6 ref_antigen: BC6 ref_antigen_pattern BC6 
#> 9 
#> matched data_antigen: CD235_CD61 ref_antigen: CD235_CD61 ref_antigen_pattern CD235 
#> 10 
#> matched data_antigen: CD45 ref_antigen: CD45 ref_antigen_pattern CD45 
#> 11 
#> matched data_antigen: cPARP ref_antigen: cPARP ref_antigen_pattern PARP 
#> 12 
#> matched data_antigen: pPLCg1_2 ref_antigen: pPLCg1_2 ref_antigen_pattern pPLCg1_2 
#> 13 
#> matched data_antigen: CD19 ref_antigen: CD19 ref_antigen_pattern CD19 
#> 14 
#> matched data_antigen: CD22 ref_antigen: CD22 ref_antigen_pattern CD22 
#> 15 
#> matched data_antigen: p4EBP1 ref_antigen: p4EBP1 ref_antigen_pattern p4EBP1 
#> 16 
#> matched data_antigen: tIkaros ref_antigen: tIkaros ref_antigen_pattern tIkaros 
#> 17 
#> matched data_antigen: CD79b ref_antigen: CD79b ref_antigen_pattern CD79b 
#> 18 
#> matched data_antigen: CD20 ref_antigen: CD20 ref_antigen_pattern CD20 
#> 19 
#> matched data_antigen: CD34 ref_antigen: CD34 ref_antigen_pattern CD34 
#> 20 
#> matched data_antigen: CD179a ref_antigen: CD179a ref_antigen_pattern CD179a 
#> 21 
#> matched data_antigen: pSTAT5 ref_antigen: pSTAT5 ref_antigen_pattern pSTAT5 
#> 22 
#> matched data_antigen: Ki67 ref_antigen: Ki67 ref_antigen_pattern Ki67 
#> 23 
#> matched data_antigen: IgMi ref_antigen: IgMi ref_antigen_pattern IgMi 
#> 24 
#> matched data_antigen: Kappa_lambda ref_antigen: Kappa_lambda ref_antigen_pattern appa 
#> 25 
#> matched data_antigen: CD10 ref_antigen: CD10 ref_antigen_pattern CD10 
#> 26 
#> matched data_antigen: CD179b ref_antigen: CD179b ref_antigen_pattern CD179b 
#> 27 
#> matched data_antigen: CD24 ref_antigen: CD24 ref_antigen_pattern CD24 
#> 28 
#> matched data_antigen: TSLPr ref_antigen: TSLPr ref_antigen_pattern TSLPr 
#> 29 
#> matched data_antigen: CD127 ref_antigen: CD127 ref_antigen_pattern CD127 
#> 30 
#> matched data_antigen: RAG1 ref_antigen: RAG1 ref_antigen_pattern RAG1 
#> 31 
#> matched data_antigen: TdT ref_antigen: TdT ref_antigen_pattern Td 
#> 32 
#> matched data_antigen: Pax5 ref_antigen: Pax5 ref_antigen_pattern Pax5 
#> 33 
#> matched data_antigen: pSyk ref_antigen: pSyk ref_antigen_pattern pSyk 
#> 34 
#> matched data_antigen: CD43 ref_antigen: CD43 ref_antigen_pattern CD43 
#> 35 
#> matched data_antigen: CD38 ref_antigen: CD38 ref_antigen_pattern CD38 
#> 36 
#> matched data_antigen:  ref_antigen: CD3 ref_antigen_pattern CD3^ 
#> 37 
#> matched data_antigen: FITC_myeloid ref_antigen: CD33 ref_antigen_pattern FITC|CD33 
#> 38 
#> matched data_antigen: pS6 ref_antigen: pS6 ref_antigen_pattern pS6 
#> 39 
#> matched data_antigen: pErk ref_antigen: pErk ref_antigen_pattern pErk 
#> 40 
#> matched data_antigen: HLADR ref_antigen: HLADR ref_antigen_pattern HLADR 
#> 41 
#> matched data_antigen: IgMs ref_antigen: IgMs ref_antigen_pattern IgMs 
#> 42 
...

This step homogenized the marker names of your target samples to the names indicated in your test_panel.csv file.

Next, we will generate an RData object which contains statistics of the generalized anchors.

#prep external anchor 
anchor_metadata_filename <- paste0(path.package("cytofin"),"/extdata/test_anchor_metadata_raw.csv")
input_file_dir_anprep <- output_file_dir_homogenize #use the homogenized files
anprep(anchor_metadata_filename, panel_filename, input_file_dir_anprep)
#> [1] "concatenated_control_untransformed.fcs"

Time to perform the batch normalization using CytofIN normalization function using healthy control samples as generalized anchors:

#data normalization using external anchors and meanshift transofmration function
val_file_dir <- paste0(path.package("cytofin"),"/extdata/test_batch_fcs_files/")
anchor_data_filename <- "./Prep_control.RData"
output_file_dir_annorm <- "norm_test/"
mode <- "meanshift"
annorm(anchor_metadata_filename, anchor_data_filename, metadata_filename, panel_filename, 
input_file_dir_anprep, val_file_dir, output_file_dir_annorm, mode)
#> Warning in dir.create(output_file_dir): 'norm_test' already exists
#> ALL05v2_Plate2_UPN94 das.fcs

#> ALL08_Plate8_UPN26 basal.fcs

#> CRLF2_Plate1_UPN53 das + TSLP.fcs

#> ALL05v2_Plate2_healthy basal1.fcs

#> ALL08_Plate8_Healthy03 basal.fcs

#> CRLF2_Plate1_Healthy 04 BCR.fcs

#> MS_Plate5_SU978 Basal.fcs

#> MS_Plate5_Healthy BM.fcs

#> SJ_Plate2_TB010950_Basal.fcs

#> SJ_Plate2_Healthy_BM.fcs

You can see samples get better after normalization when normalized signal (dark green) moved closer to the validation signal (purple).

Now let us try to normalize with stable channels as generalized anchors:

#data normalization using 4 internal channels and meanshift_bulk transformation function
nchannels <- 4
output_file_dir_annorm_nrs <- "norm_test2/"
annorm_nrs(metadata_filename, panel_filename, input_file_dir_anprep, val_file_dir, 
output_file_dir_annorm_nrs, nchannels)
#> Warning in dir.create(output_file_dir): 'norm_test2' already exists
#> Warning: `fun.y` is deprecated. Use `fun` instead.

#> ALL05v2_Plate2_UPN94 das.fcs

#> ALL08_Plate8_UPN26 basal.fcs

#> CRLF2_Plate1_UPN53 das + TSLP.fcs

#> ALL05v2_Plate2_healthy basal1.fcs

#> ALL08_Plate8_Healthy03 basal.fcs

#> CRLF2_Plate1_Healthy 04 BCR.fcs

#> MS_Plate5_SU978 Basal.fcs

#> MS_Plate5_Healthy BM.fcs

#> SJ_Plate2_TB010950_Basal.fcs

#> SJ_Plate2_Healthy_BM.fcs